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Recently, two new parallel algorithms for on-the-fly model checking of LTL properties were pre- 
sented at the same conference: Automated Technology for Verification and Analysis, 2011. Both 
approaches extend Swarmed Ndfs, which runs several sequential Ndfs instances in parallel. While 
parallel random search already speeds up detection of bugs, the workers must share some global 
information in order to speed up full verification of correct models. The two algorithms differ con- 
siderably in the global information shared between workers, and in the way they synchronize. 

Here, we provide a thorough experimental comparison between the two algorithms, by measuring 
the runtime of their implementations on a multi-core machine. Both algorithms were implemented 
in the same framework of the model checker LTSmin, using similar optimizations, and have been 
subjected to the full Beem model database. 

Because both algorithms have complementary advantages, we constructed an algorithm that com- 
bines both ideas. This combination clearly has an improved speedup. We also compare the results 
with the alternative parallel algorithm for accepting cycle detection Owcty-Map. Finally, we study 
a simple statistical model for input models that do contain accepting cycles. The goal is to distin- 
guish the speedup due to parallel random search from the speedup that can be attributed to clever 
work sharing schemes. 

1 Introduction 

Model checking is an important technique to automatically verify that a system's behavior is free from 
subtle bugs, for instance violations of safety and liveness requirements. Linear Time Logic (LTL) ex- 
presses such requirements as properties on individual runs of a system. LTL model checking reduces to 
detecting accepting cycles in a so-called Buchi automaton |"20l. A linear-time algorithm to detect those 
cycles is the Nested Depth-First Search (NDFS) algorithm, introduced by Courcoubetis et al. [6|. NDFS 
can also terminate as soon as some accepting cycle is found, which makes it very useful for bug hunting. 
Still, model checking is a time- and memory-consuming procedure due to the sheer size of the state space 
of realistic systems, leading to an extremely large Buchi automaton. 

During the last decades, processor speeds have been greatly increased, making model checkers much 
more powerful. Where early papers discussed the verification of models with a few thousand states, 
currently we can easily handle billions of states |[T6l ITSl [T4l . Recently, however, these advances are 
grinding to a halt, because of physical limits inside the CPU cores. Instead, the number of logical 
computing cores increases. Nonetheless, model checking can still benefit from the progress made by 
CPU manufacturers, if the algorithms are parallelized. 

A complication is that Dfs (and thus NDFS) is inherently sequential lITSll . Barnat et al. have there- 
fore introduced breadth-first search (Bfs) based algorithms, such as Maximal-Accepting-Predecessors 
(Map |5 I) and One-Way-Catch-Them-Young (OwCTY [21J). These algorithms deliver excellent speedups, 
but sacrifice linear-time complexity. However, their latest combined Owcty-Map algorithm IH, is 
linear-time for the class of weak LTL properties and also useful for bug hunting. It is therefore the 
current state of the art in multi-core LTL model checking. 
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Recently, also two parallel NDFS-based algorithms were introduced El [IS. Both take as starting 
point a randomized parallel search by a swarm of Ndfs workers. While this is useful for bug-hunting, 
it does not really help in the absence of bugs, in which case all workers traverse the full state space. To 
improve speedup, both algorithms share some global information between workers, in order to reduce 
the amount of work even in the absence of accepting cycles. ENdfs from Evangelista et al. Q shares 
a lot of information, but this may break the required Dfs order. A sequential repair procedure steps in 
when a potentially dangerous situation is detected. On the other hand, LNdfs from Laarman et al. [ 13J 
shares less global information and adds extra synchronization. This avoids dangerous situations and the 
need for a repair strategy. However, this leads to a reduced amount of work sharing in some cases. 

Contributions. The main goal of this paper is to experimentally compare both multi-core Ndfs algo- 
rithms. In order to enable a fair comparison, we extended ENdfs with the same optimizations as used 
in LNdfs. We implemented both algorithms in the same framework of LTSmin. Finally, we subjected 
both implementations to the full Beem benchmark database y/TJ, running them on shared memory ma- 
chines with up to 16 cores. Note that actual runtimes had not yet been reported for ENDFS, although 
workload distributions were shown in ||71. Also, for LNDFS, we have rerun the experiments from lfT3l . 

Another contribution is a simple combination of the ENDFS and LNDFS algorithms, improving the 
speedup compared to both of them. We also compare all mentioned algorithms with the Owcty-Map 
algorithm, both for bug hunting and for full verification. Finally, based on a simple statistical model |[T2ll . 
we investigate how much of the speedup in the parallel NDFS algorithms should be contributed to the 
effects of parallel random search and what is the contribution of the more clever work sharing schemes. 

The algorithms are explained in Section |2] The experimental results are presented in Section |3] Sec- 
tion]?] contains the discussion on parallel random search. Our conclusions are summarized in Section ]5] 

2 Parallel Algorithms to Detect Accepting Cycles 

Model checking properties from Linear Temporal Logic (LTL) entails verifying that all runs of a given 
system satisfy some safety or liveness property. In the automata-theoretic approach EOl [Til, a Buchi 
automaton is constructed that accepts all infinite words corresponding to those runs of the original system 
that violate the property. So the problem is reduced to the emptiness check of w-regular languages. A 
Biichi automaton accepts a word if it visits some accepting state infinitely often. For finite automata, this 
implies that there is a cycle through some accepting state. 

Definition 1 A Biichi automaton is a quadruple 1% = si, post, s^), where 5^ is the finite set of states, 
sj £ <y is the initial state, post : — )• 2'^ the successor fiinction, and (^5^ the set of accepting states. 

Note, that the use of the post function reflects the way in which the Biichi automaton is computed 
on-the-fly from the input model. When appropriate, we refer to the complete automaton as graph or 
state space. 

The purpose of all algoritlims in this paper is to detect an accepting cycle in this graph. For states 
s,t E 5^ , we write 5 — f if f G post(5), and — )■+ (—)■*), for its (reflexive) transitive closure. An accepting 
cycle is some state a G , which is reachable from the initial state {si — d) and lies on a non-trivial 
cycle (a — )•+ a). 
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1 proc ndfs(5) u proc dfs blue(.s) 

2 dfs blue(.s) 12 s. color := cyan 

3 report no cycle b for all t in post(.s) do 

14 if t. color = cyan and {sE£/\/tEs^) 

4 proc dfs red(i) ^^ report cycle & exit 

5 s.color := red „ if ^ ^^/or = white 

6 for all t in post(s) do „ dfs_blue(f) 



i f t. color = cyan jg i f s G ^ 

report cycle & exit „ dfs red(5) 



9 else if t. color = blue 



20 



else 



10 dfs_red(f) ^i ^.coZor := /7ZMe 



Figure 1 : The (sequential) New Ndfs algorithm adapted from |fT9l 



2.1 Nested Depth-First Search 

The first linear-time algorithm to detect accepting cycles was proposed by Courcoubetis et al. [6] and 
is referred to as Nested Depth-First Search (Ndfs). Ndfs also enjoys the on-the-fly property. This 
means that the algorithm can terminate as soon as a cycle is detected, without the need to visit (or even 
construct) the whole graph. This makes NDFS very suitable for bug hunting, besides its use for full 
verification. Various extensions and optimizations to NDFS have been proposed ifTTlfTQl lQl. Alg.[T]most 
closely resembles New NDFS [.19J . 

In Alg.[TJ ndfs(5/) initiates a blue Dfs from the initial state, so called since explored states are colored 
blue (we assume that initially all states are white). A newly visited state is first colored cyan ("it is on 



the DFS-stack"), and during backtracking after exploration, it is colored full blue. However, if at 1 18 
the blue Dfs backtracks over an accepting state s G then dfs_red(5) is called, which is the nested 
red Dfs to determine whether there exists a cycle containing s. As soon as a cyan state is found on ijV] 
an accepting cycle is reported lITTlfTQl . Early cycle detection is also possible in the blue DFS at l jl4|15| 
Due to early cycle detection, it does not matter that the cyan color of s is overwritten by red at lj5] llT3l 
Sect. 4.4]. 

Ndfs runs in linear time, since each reachable state is visited at most twice, once in the blue DFS 
and once in a red DFS. The correctness of NDFS essentially depends on the fact that the red Dfss are 
initiated on accepting states in the post order imposed by the blue DFS. So the red search will never hit 
another accepting state that is not already red. 



2.2 Embarrassing Parallelization: Swarmed NDFS 

The inherently DFS nature of the blue search makes NDFS hard to parallelize, since computing the post 
order is a P-complete problem [18|. One response has been to develop entirely different algorithms based 
on Breadth-First Search, cf . Sec. |2.6| 

Another approach would be to simply run N isolated instances of NDFS (Alg. [T} in parallel, in the 
hope that this swann of NDFS workers will detect accepting cycles earlier |[T0l[T3l . Local permutations 
of the post function direct the workers to different regions of the state space, so their search becomes 
independent. With postf (postj') we denote the permutation of successors used in the blue (red) DFS by 
worker /. Section [4] analyses the expected and actual improvements due to parallel randomized search. 

Although Swarmed NDFS is expected to be profitable for bug hunting, it does not show a speedup in 
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1 proc lndfs(.s,A^) 15 proc dfs blue(.s,/) 

2 d f s _ bl u e 1)||..|| d fs _ bl u e (.s,A^) le allred := true 

3 report no cycle n s.color\i\ := cyan 

18 for all t in postf(i) do 

4 proc dfs_red(i,/) if t.color\i] ^ cyan and (..e^Vfe^) 

5 s.color[i] := ^^p^^^^ ^.y^l^ ^ g,, 

6 for all t in post; (5) do 2, if f.cotor[/] wtoe A-f.re^/ 

if t.color[i]=cyan dfs_blue(f,0 

8 report cycle & exit all .^t.red 

9 if f.coZor[/]^/,/«^A-f.red ^^^^^^ _ f^l^^ 

25 if allred 

26 s.red := true 

27 else ifiG^ 

28 s. count := s.coMnf+1 

29 dfs red(i,/) 

30 i.coZor[/] := fe/Me 



10 

11 i f i e 

12 s.count := i.cownf— 1 

13 await s.count = 

14 i.rec/ := true 



Figure 2: The LNdfs algorithm, pruning blue and red DPS by a global red color, adapted from |[T3l . 



the absence of accepting cycles, in which case all workers have to go through the complete state space. 
Indeed, the worst-case complexity of all parallel Ndfs variations in this paper is (^(| — | • lA^^I), i.e. linear 
both in the size of the BUchi automaton and in the number of workers. 

In order to improve average speedup, some more synchronization between the workers is needed. 
Note that a naive global sharing of colors between multiple workers would be incorrect, because it 
would destroy the post-order properties on which Ndfs relies. Next, we discuss two recent proposals 
for sharing information between the NDFS workers. 



2.3 LNDFS: Sharing the Red Color Globally 

The basic idea behind LNdfs in Alg. [2]is to share information in the backtrack of the red Dfss |[T3l . A 
new pink color is introduced at lj5]to signify states on the stack of a red Dfs, analogous to cyan for a 
blue Dfs. The cyan, blue and pink colors are all local to worker /, but the red color is shared globally. 
On backtracking from the red DFS, states are colored red at 1 14 These red states are ignored by all blue 
and red Dfss (1 21|9 ), thus pruning the search space for all workers /. To improve pruning during the 
blue search, the amount of red states is even increased by the allred extension from ||9l (1 16 and 1 23 26 1. 

To ensure correctness, it is necessary to synchronize the red coloring of accepting states (see 1 13 1. 
Otherwise, the algorithm is incorrect for more than two workers (see [13|, which provides a correctness 
proof for N > workers). Scalability of the LNDFS algorithm could be hampered by the need for 
synchronization, but waiting is only needed when multiple workers start a red search from the same 
accepting state; this does not happen often in practice. Another reason for limited scalability is that work 
is only pruned when states can be marked red. Despite the allred extension, for input graphs with no (or 
very few) accepting states, all workers still have to traverse the whole graph. 



2.4 ENdfS: an Optimistic Approach with Repair Strategy 

The basic idea of ENdfs in Alg. |3] Q is to share both the blue and the red colors globally; only the 
cyan and pink colors are local per worker. We deviate from the description in Q by adding a cyan 
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1 proc endfs(.s,A^) 

2 dfs_blue(.9,l)||..||dfs_blue(.9,A^) 

3 report no cycle 

4 proc dfs red(i,/) 

5 s.pink[i] := true 

6 Ri := R,U{s} 

I for all f in post[(5) do 

8 if t.cyan[i] 

9 report cycle &l exit all 

10 if te^/A^t.red 

II t. dangerous := true 

12 if -^t.red A^t.pink\i] 

13 dfs_red(i,i) 



14 proc dfs_ blue 



15 5.c3'fln[/] := true 

16 for all r in postf(.s) do 

17 if t.cyan[i] and {s £ £/ W t £ £/) 

18 report cycle &i exit all 

19 if -^t .cyan[i] A .blue 

20 dfs blue(f,/) 

21 i.cyan[i] := false 

22 s.blue := true 

23 i f 5 e ^ 

24 Ri := 

25 dfs _ red 

26 f o r a 1 1 r e do 

27 i f -^r.dangerousy s = r 

28 r.ret/ := true 

29 i f s.dangerous 

30 ndfs(i,i) 



Figure 3: The optimistic ENdfs algorithm, marking dangerous states, adapted from Q. 



stack and early cycle detection as optimizations, because this enables a fair comparison with LNdfs. 
Consequently, we also renamed the local colors. 

Sharing the blue color can lead to problems, as the post-order is not preserved by the algorithm. 
ENdfs optimistically proceeds, but if it encounters accepting states that are not yet red during the red 
search, they are marked dangerous at 1{TT] Eventually, dangerous states are double-checked in a repair 
stage, by a separate sequential Ndfs using worker-local colors only, at 1 29 30 Note that for technical 
reasons, states are not colored red during backtracking, but just collected in the thread-local set/?/ at lj6] 
Only after termination of the red Dfs they are made red (provided they are not dangerous) at I j26p8| 

Scalability of the ENdfs algorithm could be hampered by the repair stage, because this proceeds 
sequentially. Also, marking states red occurs relatively late, potentially leading to more duplicate work 
within the red Dfs. 



2.5 A Combined Version: New MC-NDFS 

We have recapitulated two very recent multi-core Ndfs algorithms, which both seem to have their merits 
and pitfalls. ENDFS, in the end, resorts to a sequential repair strategy, but it avoids some work duplication 
due to the global blue color. LNdfs does not need a repair strategy, but the blue DFS is only pruned 
when there are sufficiently many red states, and the algorithm may have to wait for synchronization. 

A simple idea suggests itself here: we could combine the two algorithms and try to reconcile their 
strong points. The idea is simply to run the optimistic algorithm Alg. |3} but when dangerous states are 
encountered at lj30} we call the parallel algorithm LNDFS (rather than NDFS). 

We expect an improved speedup, because using ENDFS ensures good work sharing, even in the 
absence of accepting states. And using LNDFS parallelizes the repair strategy, avoiding the important 
sequential bottleneck of ENdfs. In the actual implementation, we also used a simple load balancing 
strategy: when a worker finishes ENdfs, it starts helping other workers still in their LNDFS repair phase. 
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2.6 One- Way-Catch-Them- Young with Maximal Accepting Predecessors 

In the next section, we will compare the performance of the various Ndfs implementations in terms of 
their absolute timing and speedup behavior. We will also compare them with the current state-of-the-art 
algorithm in parallel symbolic model checking, Owcty-Map [4J by Barnat et al., which is a member of 
the branch of BFS-based algorithms. 

Basically, it extends the One-Way-Catch-Them-Young algorithm (OwCTY ||211 ). with an initialization 
phase incorporated from the Maximal-Accepting-Predecessor algorithm (Map lH). In a nutshell. Map 
iteratively propagates unique node identifiers to successors. As soon as an accepting state receives its 
own identifier, a cycle is detected. OwCTY is based on topological sort and iteratively eliminates states 
that cannot lie on an accepting cycle, because they have no predecessors. 

These algorithms are generally based on Bfs, which is more easy to parallelize than Dfs. However, 
these algorithms sacrifice linear-time behavior and the on-the-fly property. The resulting combination is 
linear-time for Biichi automata generated from the class of weak LTL properties, and shows on-the-fly 
behavior for several cases. 

3 Experiments 

We implemented multi-core Swarmed Ndfs and Alg. [2] and Alg. |3] in the multi-core backend of the 
LTSmin model checking tool suite 1 16in31fT4l|{^ We performed experiments on an AMD Opteron 8356 
16-core (4x4 cores) server with 64 GB RAM, running a patched Linux 2.6.32 kernel. All tools were 
compiled using gcc 4.4.3 in 64-bit mode with high compiler optimizations (-03). 

We measured performance characteristics for all 453 models with properties of the Beem database 
ifTTl and compared the runs with the best known parallel LTL model checking algorithm Owcty-Map 
as implemented in DiVlNE 2.5 ||2l|3l. In fact, we used the latest release available from the development 
repository on 23 March 201 1, which was close to the 2.5 version, except for a few relevant bug fixes. 

Note that Owcty-Map has been implemented in DiVlNE, whereas ah NDFS-based algorithms have 
been implemented in LTSmin. This should be taken into account when comparing absolute runtimes. 
LTSmin implements a generic interface around the fast implementation of the post function of DiVlNE, 
resulting in sequential runtimes that can be twice as slow. On the other hand, LTSmin internally uses 
shared hash tables, which are shown to scale better, at least for reachability |T5 1 . 

To account for the random nature of the algorithms, all experiments were executed a total of 5 times. 
The data presented in the following subsections reflect the average over those 5 experiments. 

3.1 ENdfs Benchmarks 

Evangelista et al. [7] used workload distribution measurements to estimate the scalability of ENdfs. 
Fig. [5] reflects their estimated speedups (the exact numbers were extracted from ||8l, which provides 
experiments for more models, but shows equal numbers to those reported in f71). Fig. [4] shows the 
speedups that we obtained by measuring real runtimes of the algorithm. 

A comparison with the estimated speedups shows that the trend of the lines has been accurately 
predicted in most cases. A case by case comparison shows, however, that there is some divergence 
between the exact numbers: models that scale well in "synthetic" benchmarks of Fig. [5]as, for example, 
anderson.6.prop4, elevator2 . 3 .prop4, leader_election. 6 .prop2 and szymanski . 4 . prop4. 



Available on the LTSMIN website: Ihttp : //fmt . cs.utwente.nl/tools/ltsmin/ 
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Figure 4: Measured speedups for ENdfs. Figure 5: Estimated speedups for ENdfs from ||7]|. 



do not scale well in practice. We have not investigated the source of these differences, but apparently the 
amount of dangerous states is quite sensitive to implementation parameters. 

Fig.[6]and Fig.[7]compress the results from all models of the Beem database in log-log scatter plots. 
In both figures, we show models without accepting cycles as dots and models with these cycles as crosses. 
Comparing ENDFS to Ndfs in the first figure, we can distinguish good speedups for the models with 
cycles, while the other figure shows that ENDFS even improves the results of Swarmed Ndfs a little. In 
Section |4j we investigate and compare these effects more thoroughly, using a statistical reference model 
for random parallel search. As for the models without accepting cycles, we see that most do scale with 
ENdfs, but hardly beyond a speedup of 10. Even though theoretically possible, we identified no cases 
where the repair strategy of ENDFS yields speed downs (in the worst case, all workers can traverse the 
state space 4 times). 
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Figure 8: State space coverage of ENdfs repair. Figure 9: Cumulative extra work due to repair. 



We also investigated what caused some inputs to scale poorly. Fig. [8] shows the percentage of the 
state space that is covered by the repair procedure. As expected, a high percentage was measured for all 
models with poor scalability. Fig.|9]shows the cumulative additional work performed by all workers, by 
summing up the states visited by all workers in the repair procedure and dividing by the total amount of 
states (1=5^1). It is worrisome that the need for repair can increase faster than the number of cores. This 
suggests that the ENdfs may not scale to many-core systems. 



3.2 ENdfs versus LNdfs 



Fig. 10 shows the speedups of the LNdfs algorithm. In this set of models, few scale well with this 
algorithm. The flat lines represent models with relatively few states reachable from accepting states. In 
these cases, the algorithm can only color few states red, thus limiting work sharing between the workers. 
As shown in |[T3l . the fraction of red states is indeed directly related to the speedup that is obtained. 
The two models leader_f ilters.7.prop2 and leader_election.6.prop2 have state spaces that 
are colored entirely red, and hence exhibit almost ideal linear speedups. However, Fig. 12 shows that 
only few models behave this ideally. Unfortunately, in [13] we reported better speedups, which we have 
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Figure 10: Speedups LNDFS. 



Figure 11: Speedups NMc-NDFS. 
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Figure 12: Ndfs vs LNdfs. Figure 13: ENdfs vs LNdfs. 



now tracked down to an implementation error that led to too many red states. 



When comparing ENdfs to LNDFS in Fig. 13 we witness a few ties (on the thick line), a few 



winners with LNDFS and by far the most winners with ENDFS. We looked up the models that draw a tie 
and found that all of them scale with both algorithms. These are therefore not in need of improvements. 
Most interestingly, the models that scale well with LNDFS correspond to those that do not scale with 
ENdfs. This indicates that both algorithms are complementary. A fact that is indeed to be expected, 
because the same accepting states that cause states to be colored red in LNdfs, are potentially marked 



dangerous in ENdfs. This motivated their combination as described in Section 2.5 



3.3 NMC-NDFS Benchmarks 

In this subsection, we investigate our proposal for the combination of ENdfs and LNDFS into NMc-NDFS. 
Fig. 1 1 shows that NMc-NDFS improves upon the speedups of ENDFS (see Fig.|4]), and Fig. 14 confirms 



that all models scale well with the combined algorithm. 

For NMc-NDFS, again, we also calculated the cumulative additional work as a percentage of the 



state space in Fig. 16 The state space coverage by the repair procedure is almost equal to that of ENDFS 
in Fig. [8] We can then deduce that the repair work is parallelized well by LNDFS, because the cumulative 
additional work is close to the percentage of state space coverage. This can be explained by the fact that 
LNdfs is always called on a (dangerous) accepting state in NMc-NDFS, which eventually leads to a red 
coloring of the entire subgraph reachable from this accepting state. Under these conditions LNDFS can 
be expected to scale well. 

We also checked whether the new combination causes additional overhead, by comparing it directly 



with its predecessors in Fig. 18 and Fig. [19] The first figure shows that no model runs faster with ENDFS 
than with NMc-NDFS, although in a few examples LNDFS wins, as can be seen in the latter figure. This 
confirms that LNDFS and ENDFS are complementary and their combination represents the best from 
both worlds. Indeed, the combination ensures that for all inputs some speedup is obtained. 
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Figure 1 6 : Cumulative 
NMC-NDFS repair. 



extra work due to 



Figure 17: Speedups OwCTY. 



3,4 Parallel NDFS versus OWCTY-MAP 



Fig. 15 compares NMc-NDFS with Owcty-Map. The comparison figures show that the heuristic 
on-the-fly method of Owcty-Map is no match for the truly on-the-fly parallel NDFS algorithms. As for 
the models without accepting cycles, we can conclude that currently NMc-NDFS provides a good match 
for Owcty-Map, in particular for the larger models. For the sake of completeness, we present here 



Fig. 20 21 which show a comparison between ENdfs/LNdfs and Owcty-Map. Furthermore, Fig. 17 
shows the absolute speedups of Owcty-Map using the sequential Ndfs runtimes as the base case. 
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l.E-02 l.E-01 l.E+00 l.E+01 l.E+02 l.E+03 l.E+04 l.E-02 l.E-01 l.E+00 l.E+01 l.E+02 l.E+03 l.E+04 
NMCNdfs (16 cores) | NMCNdfs (16 cores) 

Figure 18: ENdfs vs NMc-ndfs. Figure 19: LNdfs vs NMc-ndfs. 




Figure 20: Owcty-Map vs ENdfs. 



Figure 21: Owcty-Map vs LNdfs. 
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4 Discussion on Parallel Random Search 

As explained in Section [2j the multi-core Ndfs algorithms use a randomized post function to direct 
workers to different regions of the state space. In this section, we want to explain the speedup for models 
with accepting cycles. In particular, we want to distinguish the effect of parallel random search, from the 
effect of the clever work sharing algorithms. 

Our starting point is a simple statistical model as found in | fT2| . We view Ndfs(=^,X) as an algorithm 
that runs on BUchi automaton ^ with random seed X, influencing the order of traversing successors. We 
ran Ndfs(=^,X) 500 times with random X on a number of Biichi automata Each time, we measured 
/(^,X), the time that it takes for Ndfs(^,X) to detect an accepting cycle. 



In Figure 22 we show the cumulative probability F{^,t) that one Ndfs worker will detect an 
accepting cycle in less than t seconds for some examples from the Beem database. We can also define 
F[^{^,t) as the cumulative probability that a swarm of independent workers will find an accepting 



cycle within t seconds. Figure 23 shows Fk, {^,t) for the same automata. We also computed the expected 



time to completion and the standard deviation. The new distribution can be easily computed as: 

FN{^,t) = \-{\-F{^,t)f 




100 



max runtime (sec) 



Figure 22: Cumulative probability distribution of finding a bug (measured for 1 worker). 




100 



max runtime (sec) 



Figure 23: Cumulative probability distribution of finding a bug (calculated for 16 workers). 
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From Fig. 22 and Fig. 23 we observe that considerable gains can be expected from a simple par- 
allelization as in Swarmed Ndfs. It also shows that the actual speedup depends highly on the mod- 
els: when all runs find an accepting cycle in about the same time (indicated by plateaus connected 
by a steep curve), the expected gain is much less than when the curve is flatter, as is the case for 
anderson. 8 .prop3, bakery . 8 . prop4 and peterson. 6 . prop4. 

Next, we want to compare our actual implementation with these predictions. To this end, we com- 
pared the expected completion times with actual completion times, averaged over 5 runs. We collected 
this information in Table [T] In the first two columns (Statistical model), we copied the averages from 



Fig. 22 23 for 1 and 16 workers, and computed the expected speedup. Note that this speedup for 16 
workers is way below 16. Next, we experimented with four different scenarios described below. 

The next column (Distributed), corresponds to Swarmed Ndfs as it would run on different machines 
in a GRID. Here the only synchronization would be to terminate all workers as soon as the first worker 
has detected a cycle. The runtimes denote the completion time for the earliest run out of 16 independent 
workers; we again provide the average from 5 experiments. The corresponding speedups match closely 
to the predicted ones from the statistical model. 



Table 1: Runtimes and speedups of bug hunting using embarrassingly parallel (randomized) NDFS and 
LNdfs. The first two columns of the table present the expected completion time derived from 500 
sequential experiments for 1 and 16 cores. The other columns give parallel runtimes for, respectively, a 
distributed implementation, our randomized shared-memory implementation [.13,|. and another shared- 
memory implementation using the fresh successor heuristic. The second row gives the speedups. 
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Next, we ran the experiments on the multi-core machine with 16 cores described before. Now the 
workers share the basic infrastructure. This is the same setting as the multi-core Swarmed Ndfs from 
the previous section. For instance, all states will be stored only once in a shared hash table. Also, 
several workers now share information in the L2 cache. On the other hand, they might now suffer from 
cache coherence overhead or memory bus contention. The figures under "Shared Memory" show that 
the speedups in a multi-core environment are slightly better than on independent machines (Distributed). 

On multi-core machines it becomes easier to share information, in order to guide different workers 
into different parts of the state space. In that case, one would expect better speedup figures. We did an 
experiment with what we call the fresh successor heuristic. Here a worker will randomly select a globally 
unvisited successor if that exists, otherwise it randomly selects any successor. As the column Heuristic 
shows, this can dramatically improve the speedup of 16 workers. In some cases, each time an accepting 
cycle was found in such a small instant that a meaningful speedup figure could not be computed. 

Finally, using LNdfs, the total amount of work is decreased, because workers prune each other's 
search space. Again, we experimented with two versions, which are shown in the two right-most 
columns. We computed the average runtime of 5 experiments on 16 cores with the random shared- 
memory implementation. Note that this is the implementation that was used in all previous experiments 
in Section |3] The figures show again a big improvement over Swarmed Ndfs, even on a multi-core 
machine. Interestingly, the fresh successor heuristic also works very well for the LNDFS-algorithm, 
speeding up the algorithm several times. Similar findings hold for all other parallel NDFS versions in 
this paper, because they behave similarly on models with accepting cycles (see Fig.fTSland Fig.[T9]). 



5 Conclusion 

In this paper, we experimentally compared two recent parallel NDFS-based algorithms, ENdfs Q 
and LNdfs [,13,I . We also compared them with Swarmed NDFS and with the BFS-based algorithm 
Owcty-Map H. We now summarize the conclusions from our experiments. 

For systems with bugs (accepting cycles), both ENdfs and LNDFS outperform Owcty-Map by 
large, so they fully enjoy the on-the-fly property. We have also shown that for these cases ENDFS and 
LNdfs perform much better than parallel random search, as in Swarmed NDFS. 

On examples without bugs, it appears that ENDFS beats LNDFS in most of the cases, due to the fact 
that there are still too few red states to prune the blue search in LNDFS. However, in a number of other 
cases ENdfs still scales rather badly, due to the fact that the sequential repair strategy traverses large 
parts of the state space. Interestingly, it is possible to use the parallel LNdfs algorithm as the repair 
strategy of ENdfs. For this new combined algorithm, all examples of the Beem database showed a 
decent speedup. 

On examples without bugs, Owcty-Map beats both LNDFS and ENDFS in a majority of the cases, 
but still it is slower on a number of other examples. The combination of ENDFS and LNDFS, however, 
provided a good match for Owcty-Map, especially for the larger inputs. This shows that the new branch 
of parallel Ndfs algorithms is rather promising. 

Future work. We believe that the last word on parallel LTL model checking has not been said yet. 
Although all NDFS-versions have been implemented in the same framework so that we compare the 
algorithmic differences, Owcty-Map was implemented in the DiViNE tool. We note that our computa- 
tion of the post function uses the same code from DiViNE. A reimplementation of Owcty-Map using 
shared hash tables will probably increase its speedup, as indicated by results on pure reachability JTSl . 
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Also the young branch of parallel Ndfs algorithms can still be improved. We have already shown 
that adding heuristics to direct workers into different regions of the graph can greatly increase the per- 
formance, at least for models with bugs. An interesting question is if there exists a correct variation on 
parallel Ndfs that can fully share global information from both the blue and the red search, without the 
need to resort to a repair strategy. This would take away the current weak points of both ENdfs and 
LNdfs. 
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